# CSC 2224: Parallel Computer Architecture and Programming Main Memory Fundamentals

#### Prof. Gennady Pekhimenko University of Toronto Fall 2020

The content of this lecture is adapted from the slides of Vivek Seshadri, Donghyuk Lee, Yoongu Kim, and lectures of Onur Mutlu @ ETH and CMU

#### **Review #4**

- RowClone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization
  - Vivek Seshadri et al., MICRO 2013

Why Is Memory So Important? (Especially Today)

#### **The Performance Perspective**

• "It's the Memory, Stupid!" (Richard Sites, MPR, 1996)



Mutlu+, "Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors," HPCA 2003.

# The Energy Perspective



# The Energy Perspective

#### **Communication Dominates Arithmetic**

Dally, HiPEAC 2015



#### A memory access consumes ~1000X the energy of a complex addition

## **The Reliability Perspective**

- Data from all of Facebook's servers worldwide
- Meza+, "Revisiting Memory Errors in Large-Scale Production Data Centers," DSN'15.



Chip density (Gb)

#### **The Security Perspective**



It's like breaking into an apartment by repeatedly slamming a neighbor's door until the vibrations open the door you were after

#### Why Is DRAM So Slow?

# Motivation (by Vivek Seshadri)

Conversation with a friend from Stanford



## **Understanding DRAM**



Solutions proposed by our research

## Outline

- 1. What is DRAM?
- 2. DRAM Internal Organization
- 3. Problems and Solutions
  - Latency (Tiered-Latency DRAM, HPCA 2013, Adaptive-Latency DRAM, HPCA 2015)
  - Parallelism (Subarray-level Parallelism, ISCA 2012)

#### What is DRAM?

#### **DRAM** – Dynamic Random Access Memory

#### Array of Values



READ *address* WRITE *address*, value

Accessing any location takes the same amount of time

Data needs to be constantly refreshed

#### **DRAM in Today's Systems**



#### Why DRAM? Why not some other memory?

## Von Neumann Model



Memory performance is important

#### **Factors that Affect Choice of Memory**

1. Speed

- Should be reasonably fast compared to processor

2. Capacity

- Should be large enough to fit programs and data

- 3. Cost
  - Should be cheap

# Why DRAM?



**Access Latency** 

#### Is DRAM Fast Enough?

# Processor Commodity DRAM

#### 3 GHz, 2 Instructions / cycle

50ns

| 300 Instructic Parallelism | Request |
|----------------------------|---------|
| 300 Instructions           | Request |
| 300 Instructions           | Request |
| 300 Instructions           | Request |

Independent programs

Served in parallel?

#### Outline

1. What is DRAM?

2. DRAM Internal Organization

- 3. Problems and Solutions
  - Latency (Tiered-Latency DRAM, HPCA 2013, Adaptive-Latency DRAM, HPCA 2015)
  - Parallelism (Subarray-level Parallelism, ISCA 2012)

#### **DRAM Organization**



#### **DRAM Cell Array: Mat**



sense amplifier

# **Memory Element (Cell)**

Component that can be in at least two states



Can be electrical, magnetic, mechanical, etc.

DRAM  $\rightarrow$  Capacitor

#### **Capacitor – Bucket of Electric Charge**



## **DRAM Chip**



# **Divide and Conquer**





# Outline

#### 1. What is DRAM?

#### 2. DRAM Internal Organization

- DRAM Cell
- DRAM Array
- DRAM Bank
- 3. Problems and Solutions
  - Latency (Tiered-Latency DRAM, HPCA 2013)
  - Parallelism (Subarray-level Parallelism, ISCA 2012)

## **DRAM Cell Read Operation**



# **DRAM Cell Read Operation**



# Outline

#### 1. What is DRAM?

#### 2. DRAM Internal Organization

- DRAM Cell
- DRAM Array
- DRAM Bank
- 3. Problems and Solutions
  - Latency (Tiered-Latency DRAM, HPCA 2013; Adaptive-Latency DRAM, HPCA 2015)
  - Parallelism (Subarray-level Parallelism, ISCA 2012)

## Problem



#### **Cost Amortization**



#### **DRAM Array Operation**



# Outline

#### 1. What is DRAM?

#### 2. DRAM Internal Organization

- DRAM Cell
- DRAM Array
- DRAM Bank

#### 3. Problems and Solutions

- Latency (Tiered-Latency DRAM, HPCA 2013 Adaptive-Latency DRAM, HPCA 2015)
- Parallelism (Subarray-level Parallelism, ISCA 2012)

#### **DRAM Bank**



How to build a DRAM bank from a DRAM array?

## **DRAM Bank: Single DRAM Array?**



#### **DRAM Bank: Collection of Arrays**



#### **DRAM Operation: Summary**



#### **DRAM Chip Hierarchy**



**Collection of Subarrays** 

#### Outline

1. What is DRAM?

2. DRAM Internal Organization

#### 3. Problems and Solutions

- Latency (Tiered-Latency DRAM, HPCA 2013;
   Adaptive-Latency DRAM, HPCA 2015)
- Parallelism (Subarray-level Parallelism, ISCA 2012)

#### **Factors That Affect Performance**

- 1. Latency
  - How fast can DRAM serve a request?

- 2. Parallelism
  - How many requests can DRAM serve in parallel?

#### **DRAM Chip Hierarchy**



**Collection of Subarrays** 

## Outline

- 1. What is DRAM?
- 2. DRAM Internal Organization
- 3. Problems and Solutions
  - Latency (Tiered-Latency DRAM, HPCA 2013;
     Adaptive-Latency DRAM, HPCA 2015)
  - Parallelism (Subarray-level Parallelism, ISCA 2012)

## Subarray Size: Rows/Subarray



#### Subarray Size vs. Access Latency



#### Shorter Bitlines => Faster access



#### Smaller subarrays => lower access latency

#### Subarray Size vs. Chip Area

Large Subarray



**Smaller Subarrays** 



#### Smaller subarrays => larger chip area

#### **Chip Area vs. Access Latency**



#### **Chip Area vs. Access Latency**



How to enable low latency without high area overhead?





#### **Tiered-Latency DRAM**

Far Segment

Near Segment



- Higher access latency
- Higher energy/access

+ Lower access latency+ Lower energy/access

Map frequently accessed data to near segment

## **Results Summary**



Tiered-Latency DRAM



#### Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture

Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu

Published in the proceedings of 19<sup>th</sup> IEEE International Symposium on

**High Performance Computer Architecture 2013** 

#### **DRAM Stores Data as Charge**

Three steps of charge movement

- 1. Sensing
- 2. Restore
- 3. Precharge



#### **DRAM Charge over Time**



Why does DRAM need the extra timing margin?

#### **Two Reasons for Timing Margin**

- 1. Process Variation
  - DRAM cells are not equal
  - Leads to extra timing margin for cells that can store large amount of charge
- 2. Temperature Dependence

#### **DRAM Cells are Not Equal**





Large variation in cenific ent size → Different charge → Different charge → Large variation in charge rent latency

#### **Two Reasons for Timing Margin**

- 1. Process Variation
  - DRAM cells are not equal
  - Leads to *extra timing margin* for cells that can store large amount of charge
- 2. Temperature Dependence
  - DRAM leaks more charge at higher temperature
  - Leads to extra timing margin when operating at low temperature

#### Charge Leakage $\propto$ Temperature



Cells store small charge at high temperature and large charge at low temperature → Large variation in access latency

#### **DRAM Timing Parameters**

- DRAM timing parameters are dictated by the worst case
  - The smallest cell with the smallest charge in all DRAM products
  - Operating at <u>the highest temperature</u>
- Large timing margin for the common case
   → Can lower latency for the common case

#### DRAM Testing Infrastructure



## **Obs 1. Faster Sensing**

Typical DIMM at Low Temperature



115 DIMM characterization

Timing

(tRCD)

More charge

Strong charge flow

Faster sensing

17% ↓ No Errors

Typical DIMM at Low Temperature  $\rightarrow$  *More charge*  $\rightarrow$  *Faster sensing* 

## **Obs 2. Reducing Restore Time**

Typical DIMM at Low Temperature



Larger cell & Less leakage → Extra charge

No need to fully restore charge

115 DIMM characterization

Read (tRAS)
37% ↓

Write (tWR) 54% ↓ No Errors

→ Typical DIMM at lower temperature
→ More charge → Restore time reduction

## **Obs 3. Reducing Precharge Time**

Typical DIMM at Low Temperature





Sense amplifier

Precharge ? – Setting bitline to half-full charge



→ Typical DIMM at Lower Temperature
→ More charge → Precharge time reduction

## **Adaptive-Latency DRAM**

- Key idea
  - Optimize DRAM timing parameters online
- Two components
  - DRAM manufacturer profiles multiple sets of
- reliable DRAM timing parameters different

temperatures for each DIMM

System monitors DRAM temperature uses appropriate
 DRAM timing parameters

#### **Real System Evaluation** Average Performance Improvement 25% Improvement Single Core **Multi**-Core 20% 14.0% 15% 10.4% 10% 5% 0% soplex сору mcf milc gems libq Ibm gups all-35-workload non-intensive ntensive s.cluster

<sup>®</sup> AL-DRAM provides high performance <sup>®</sup> improvement, greater for multi-core workloads

#### **Summary: AL-DRAM**

- Observation
  - DRAM timing parameters are dictated by the worst-case cell (smallest cell at highest temperature)
- Our Approach: Adaptive-Latency DRAM (AL-DRAM)
  - Optimizes DRAM timing parameters for *the common case* (typical DIMM operating at low temperatures)
- Analysis: Characterization of 115 DIMMs
  - Great potential to *lower DRAM timing parameters* (17 54%) without any errors
- Real System Performance Evaluation
  - Significant *performance improvement* (14% for memoryintensive workloads) without errors (33 days)

# Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case

Donghyuk Lee, Yoongu Kim,

- Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, and Onur Mutlu Published in the proceedings of 21<sup>st</sup>
- International Symposium on High Performance Computer Architecture 2015

## Outline

- 1. What is DRAM?
- 2. DRAM Internal Organization
- 3. Problems and Solutions
  - Latency (Tiered-Latency DRAM, HPCA 2013;

Adaptive-Latency DRAM, HPCA 2015)

– Parallelism (Subarray-level Parallelism, ISCA 2012)



#### **Increasing Number of Banks?**



Adding more banks  $\rightarrow$  Replication of shared structures Replication  $\rightarrow$  Cost

How to improve available parallelism within DRAM?

## **Our Observation**

Local to a subarray



#### **Subarray-Level Parallelism**



#### **Subarray-Level Parallelism: Benefits**



**Subarray-Level Parallelism** 

## **Results Summary**

Commodity DRAM



#### Subarray-Level Parallelism



**Energy Consumption** 

#### A Case for Exploiting Subarray-Level Parallelism (SALP) in DRAM

#### Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

Published in the proceedings of 39<sup>th</sup>

## International Symposium on Computer Architecture 2012

#### **Review #4**

- RowClone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization
  - Vivek Seshadri et al., MICRO 2013

## CSC 2224: Parallel Computer Architecture and Programming Main Memory Fundamentals

#### Prof. Gennady Pekhimenko University of Toronto Fall 2020

The content of this lecture is adapted from the slides of Vivek Seshadri, Donghyuk Lee, Yoongu Kim, and lectures of Onur Mutlu @ ETH and CMU